# Load library and corpus
library(quanteda)
ICE_GB <- readRDS("ICE_GB.RDS")
# Perform query
kwic_provide <- kwic(ICE_GB,
phrase("provid(es|ing|ed)?"),
valuetype = "regex",
window = 20)12 Regular expressions
12.1 Regular expressions
Regular expressions (or ‘regex’) help us find more complex patterns in strings of text. Suppose we are interested in finding all inflectional forms of the lemma PROVIDE in a corpus, i.e., provide, provides, providing and provided. Insteading of searching for all forms individually, we can construct a regular expression of the form
\[ \text{provid(es | ing | ed)?} \]which can be read as ‘Match the sequence of letters <provide> as well as when it is optionally followed by the letters <s> or <ing> or <ed>’. Notice how optionality is signified by the ‘?’ operator and alternatives by ‘|’.
To activate regular expression in a KWIC query, simply set the valuetype argument to "regex":
The number of hits has more than doubled. However, upon closer inspection, we’ll notice a few false positives, namely providential, provider and providers:
table(kwic_provide$keyword)
provid provide provided Provided Provident providential
1 165 118 5 1 1
provider providers provides providing Providing
1 3 72 52 1
There are two ways to handle this:
- Refine the search expression further to only match those cases of interest.
- Manually sort out irrelevant cases during annotation in your spreadsheet software.
As a rule of thumb, you should consider improving your search expression if you receive hundreds or even thousands of false hits. If there are only a couple of false positives, it’s usually easier to simply mark them as “irrelevant” in your spreadsheet.
How could you refine the search expression for PROVIDE to get rid of the irrelevant cases? Consult the RegEX Cheatsheet below!
Solution:
# Add word boundary with \\b
kwic_provide2 <- kwic(ICE_GB,
phrase("\\bprovid(e|es|ing|ed)\\b"),
valuetype = "regex",
window = 20)
table(kwic_provide2$keyword)12.2 A RegEx Cheatsheet
12.2.1 Basic functions
| Command | Definition | Example | Finds |
|---|---|---|---|
python |
python | ||
. |
Any character | .ython |
aython, bython… |
12.2.2 Character classes and alternatives
| Command | Definition | Example | Finds |
|---|---|---|---|
[abc] |
Class of characters | [jp]ython |
jython, python |
[ ^pP] |
Excluded class of characters | [^pP]ython |
everything but python, Python |
(...|...) |
Alternatives linked by logical operator or |
P(ython|eter) |
Python, Peter |
12.2.3 Pre-defined character classes
| Command | Definition | Example | Finds |
|---|---|---|---|
\\w |
All alphanumeric characters | A-Z, a-z, 0-9 | |
\\W |
All non-alphanumeric characters | everything but A-Z, a-z, 0-9 | |
\\d |
All decimal numbers | 0-9 | |
\\D |
Everything which is not a decimal number | everything but 0-9 | |
\\s |
Empty space | ||
\\b |
Word boundary | \\bpython\\b |
Matches python as a whole word |
12.2.4 Quantifiers
| Command | Definition | Example | Finds |
|---|---|---|---|
? |
One or zero instances of the preceding symbol | Py?thon |
Python, Pthon |
* |
No matter how many times — also zero | Py*thon |
Python, Pthon, Pyyyython… |
P[Yy]*thon |
Python, Pthon, PyYYython… | ||
+ |
No matter how many times but at least once | Py+thon |
Python, Pyyython, Pyyyython |
{1,3} |
{min, max} |
Py{1,3}thon |
Python, Pyython, Pyyython |
12.3 Exercises
- Find all labels of months!
- Write an elegant regular expression which finds sing, sang and sung.
- Find all four-digit numbers in the corpus!
- Write an elegant regular expression which finds all inflectional forms of swim!